Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

LWS-Det: Layer-Wise Search for 1-bit Detectors

167

(a)

(c)

(b)

FIGURE 6.9

Example layer-wise feature map distribution and detection results of (a) a real-valued detec-

tor, (b) LWS-Det, and (c) BiDet. We extract the feature maps of the ﬁrst, second, and ﬁnal

binarized layers and illustrate their distributions based on the frequency-value histogram in

rows 1–3. The last row shows the detection result.

Figure 6.9 shows the layer-wise feature map distribution and detection results of a real-

valued detector, our LWS-Det, and BiDet [240] from left to right. The ﬁrst three rows show

the distributions of feature maps. The distribution of BiDet’s feature map has a variance

less similar to the one of the real-value detector, leading to a result with false positives and

missed detection in the 4-th row. In comparison, our LWS-Det can reduce the binarization

error and provide better detection results.

In this section, we present the layer-wise search method to produce an optimized 1-bit

detector (LWS-Det) [264] using the student-teacher framework to narrow the performance

gap. As shown in Fig. 6.10, we minimize the binarization error by decoupling it into angular

and amplitude errors. We search for binarized weight supervised by well-designed losses be-

tween real-valued convolution and 1-bit convolution under diﬀerentiable binarization search

(DBS) framework, following the DARTS method [151, 305]. We formulate the binarization

problem as the combination of −1 and 1, while a diﬀerentiable search can explore the binary

space to signiﬁcantly improve the capacity of 1-bit detectors. To improve the representation

ability of LWS-Det, we design two losses to supervise the 1-bit convolution layer from angu-

lar and amplitude perspective. In this way, we obtain a powerful 1-bit detector (LWS-Det)

that can minimize angular and amplitude errors in the same framework.

6.4.1

Preliminaries

Given a conventional CNN model, we denote wi ∈Rni and ai ∈Rmi as its weights and

feature maps in the i-th layer, where ni = Ci · Ci−1 · Ki · Ki and mi = Ci · Wi · Hi. Ci

represents the number of output channels of the i-th layer. (Wi, Hi) are the width and

height of the feature maps and Ki is the kernel size. Then we have the following.

ai = ai−1 ⊗wi,

(6.65)